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Deep Learning 


Roadmap 
@ Embedding Numerous Features: Kernel Models 
@ Combining Predictive Features: Aggregation Models 
© Distilling Implicit Features: Extraction Models 









Lecture 12: Neural Network 


automatic pattern feature extraction from layers of 
neurons with backprop for GD/SGD 





Lecture 13: Deep Learning 


ə Deep Neural Network 
Autoencoder 

Denoising Autoencoder 
Principal Component Analysis 
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Deep Learning Deep Neural Network 


Physical Interpretation of NNet Revisited 





e each layer: pattern feature extracted from data, remember? :-) 
e how many neurons? how many layers? 
—more generally, what structure? 


e subjectively, your design! 
e objectively, validation, maybe? 





structural decisions: 
key issue for applying NNet | 
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Deep Learning Deep Neural Network 


Shallow versus Deep Neural Networks 


shallow: few (hidden) layers; deep: many layers 









Deep NNet 
challenging to train (x) 
sophisticated structural 
decisions (x) 

‘arbitrarily’ powerful (O) 


e more ‘meaningful’? (see 
next slide) 


deep NNet (deep learning) 
gaining attention in recent years | 
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e more efficient to train (O) 


e simpler structural 
decisions (O) 


e theoretically powerful 
enough (O) 


Deep Learning Deep Neural Network 


Meaningfulness of Deep Learning 


positive weight 








negative weight 


gs Nee 




































e ‘less burden’ for each layer: simple to complex features 
e natural for difficult learning task with raw features, like vision 


deep NNet: currently popular in 
vision/speech/. . . 
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Deep Learning Deep Neural Network 


Challenges and Key Techniques for Deep Learning 


e difficult structural decisions: 
e subjective with domain knowledge: like convolutional NNet for 
images 
e high model complexity: 
e no big worries if big enough data 
e regularization towards noise-tolerant: like 
e dropout (tolerant when network corrupted) 
e denoising (tolerant when input corrupted) 
e hard optimization problem: 


e careful initialization to avoid bad local minimum: 
called pre-training 


e huge computational complexity (worsen with big data): 
e novel hardware/architecture: like mini-batch with GPU 





IMHO, careful regularization and 
initialization are key techniques 
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Deep Learning Deep Neural Network 
A Two-Step Deep Learning Framework 
Simple Deep Learning 


@ for = 1,...,L, pre-train wi) assuming w, ee wi") fixed 
ij 
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(a) 
@ train with backprop on pre-trained NNet to fine-tune all {wi} 


will focus on simplest pre-training technique 
along with regularization | 
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Deep Learning Deep Neural Network 


Fun Time 


For a deep NNet for written character recognition from raw pixels, 
which type of features are more likely extracted after the first hidden 
layer? 

© pixels 

@ strokes 


© parts 
© digits 





Deep Learning Deep Neural Network 


Fun Time 


For a deep NNet for written character recognition from raw pixels, 
which type of features are more likely extracted after the first hidden 


layer? 
© pixels 
@ strokes 
© parts 
© digits 


Reference Answer: (2) 


Simple strokes are likely the ‘next-level’ | 
features that can be extracted from raw pixels. | 
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Deep Learning Autoencoder 


Information-Preserving Encoding 


e weights: feature transform, i.e. encoding 

e good weights: information-preserving encoding 
—next layer same info. with different representation 

e information-preserving: 

decode accurately after encoding 











idea: pre-train weights towards 
information-preserving encoding 
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Deep Learning Autoencoder af . 
Information-Preserving Neural Network 


Xo = 




















oN 


e autoencoder: d—d—d NNet with goal g;(x) ~ x; 
—learning to approximate identity function 
o wf”): encoding weights; w 


Xd 


wi? i; decoding weights 





why approximating identity function? 
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Deep Learning Autoencoder 


Usefulness of Approximating Identity Function 


if g(x) ~ x using some hidden structures on the observed data x, 
e for supervised learning: 


e hidden structure (essence) of x can be used as reasonable 
transform ®(x) 


—learning ‘informative’ representation of data 
e for unsupervised learning: 


e density estimation: larger (structure match) when g(x) ~ x 
e outlier detection: those x where g(x) % x 


—learning ‘typical’ representation of data 





autoencoder: 


representation-learning through 
approximating identity function 
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Deep Learning Autoencoder 


Basic Autoencoder 


basic autoencoder: 





d—d—d NNet with error function )>@_, (g(x) — xj)? 


backprop easily applies; shallow and easy to train 
usually d < d: compressed representation 


e data: {(%1, Y1 = X1), (Xo, y2 = Xo), cosg (XN, Yn = Xn) t 
—often categorized as unsupervised learning technique 








sometimes constrain wi) = wi) as regularization 
—more sophisticated in calculating gradient 


basic autoencoder in basic deep learning: 
{wit taken as shallowly pre-trained weights | 
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Deep Learning Autoencoder 


Pre- Training with Autoencoders 
Deep Learning with Autoencoders 


© for ¢=1,...,L, pre-train {wy} assuming wht), T whi?) fixed 


OOOO OOQ 





by training basic autoencoder on {xf} with d = da”) 





@ train with backprop on pre-trained NNet to fine-tune all {wi} 


many successful pre-training techniques take 
‘fancier’ autoencoders with different 
architectures and regularization schemes 
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Deep Learning Autoencoder 


Fun Time 


Suppose training a d-d-d autoencoder with backprop takes 
approximately c - d - d seconds. Then, what is the total number of 
seconds needed for pre-training a d-d(")-d(2)-d(3)-1 deep NNet? 


: a +d) +1) 
c (d - d”) .d@). gd) . 1) 

c (dd) + dad) + d@q(3) + d@)) 
c (dd) . dM) . dq) . d)) 





Deep Learning Autoencoder 


Fun Time 


Suppose training a d-d-d autoencoder with backprop takes 
approximately c - d - d seconds. Then, what is the total number of 
seconds needed for pre-training a d-d")-d(@)-d(3)-1 deep NNet? 
? c(d +d + a +d) +1) 
c(d. d) . dC ). d6) . 1) 
c (dd) + dM?) + g@)q(s) + d()) 
c (dd). d) d2) . q(2)q(3) . d®)) 


Reference Answer: (3) 


Each c- d1) . d® represents the time for 
pre-training with one autoencoder to determine 
one layer of the weights. 
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Deep Learning Denoising Autoencoder 


Regularization in Deep Learning 








high model complexity: regularization needed 
e structural decisions/constraints 
e weight decay or weight elimination regularizers 
e early stopping 





next: another regularization technique 
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Deep Learning Denoising Autoencoder 


Reasons of Overfitting Revisited 


A 
0 -0.2 


80 100 120 
Number of Data Points, N 







Noise Level, o? 





data size N | overfit + 
reasons of serious overfitting: noise + overfit + 


excessive power f overfit + 
















how to deal with noise? | 
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Deep Learning Denoising Autoencoder 

Dealing with Noise 
direct possibility: data cleaning/pruning, remember? :-) 
a wild possibility: adding noise to data? 





e idea: robust autoencoder should not only let g(x) ~ x 
but also allow g(x) ~ x even when x slightly different from x 
denoising autoencoder: 









run basic autoencoder with data 


{(X1,¥1 = X1), (K2, Yo = X2), - . - , (Xv, Yn = Xn)}, 
where Xn = Xn+ artificial noise 





—often used instead of basic autoencoder in deep learning 
useful for data/image processing: g(x) a denoised version of x 
effect: ‘constrain/regularize’ g towards noise-tolerant denoising 


artificial noise/hint as regularization! 
—practically also useful for other NNet/models 
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Deep Learning Denoising Autoencoder 


Fun Time 


Which of the following cannot be viewed as a regularization technique? 


@ hint the model with artificially-generated noisy data 
@ stop gradient descent early 

© add a weight elimination regularizer 

@ all the above are regularization techniques 





Deep Learning Denoising Autoencoder 


Fun Time 


Which of the following cannot be viewed as a regularization technique? 


@ hint the model with artificially-generated noisy data 
@ stop gradient descent early 

© add a weight elimination regularizer 

@ all the above are regularization techniques 


Reference Answer: (4) 


a is our new friend for regularization, while 


(2) and (3) are old friends. 
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Deep Learning Principal Component Analysis 


Linear Autoencoder Hypothesis 


nonlinear autoencoder linear autoencoder 
sophisticated simple 


linear: more efficient? less overfitting? linear first, remember? :-) 





d 
linear hypothesis for k-th component h,(x i 19> ma) 


consider three special conditions: 
e exclude x: range of / same as range of k 
e constrain wi" ) = = = Wj: regularization 
—denote W = [w;] of size d x d 


e assume d < d: ensure non-trivial solution 





linear autoencoder hypothesis: 
h(x) = Ww’x | 
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Deep Learning Principal Component Analysis 


Linear Autoencoder Error Function 






2 2 
En(h) = En(W) = — WW xn|| with d x d matrix W 











—analytic solution to e En? but 4-th order polynomial of w; 


let’s familiarize the problem with linear algebra (be brave! :-)) 


e eigen-decompose WW! = Vr yT 


e d x d matrix V orthogonal: VV" = V’V = Ig 
e d x d matrix [ diagonal with < d non-zero 

e V! (Xn): change of orthonormal basis (rotate or reflect) 
e [(---): set > d — d components to 0, and scale others 


e V(---): reconstruct by coefficients and basis (back-rotate) 


e Xn = VIV’x,: rotate and back-rotate cancel out 





next: minimize En by optimizing! and V | 
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Deep Learning Principal Component Analysis 


The Optimal T 


2 
1 N 
r He i3 aS J. 
min min NÈ VIV'xX,—VI'V' Xp, 


n=1 


Xn WW’ x, 





e back-rotate not affecting length: X 
e minr © ||(I — [)(some vector) ||?: want many 0 within (I — I) 
e optimal diagonal F with rank < d: 








{ d diagonal components 1 


Lo 
d 
nie a } => without loss of gen. | 0 | 





N 
next: min 2 
n= 
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Deep Learning Principal Component Analysis 


The Optimal V 


0 0 E Ae a 
k po [V =m D [i o | VX 


2 





N 
min X` 
a) n=1 























e d = 1: only first row v” of V” matters 
maxy X` ^4 vTxnx]v subject to v/v = 1 


e optimal v satisfies )7”_, x,x7v = Av 
—using Lagrange multiplier å, remember? :-) 
e optimal v: ‘topmost’ eigenvector of XTX 


e general d: v4 ‘topmost’ eigenvectorS of XX 
—optimal {w;} = {v; with |y; = 1]} = top eigenvectors 


linear autoencoder: projecting to orthogonal 
patterns w; that ‘matches’ {xn} most | 
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Deep Learning Principal Component Analysis 


Principal Component Analysis 








Linear Autoencoder or 





@ letx = L5 Xn andletxn — Xn- X 
@ calculate d top eigenvectors W4, W2, ..., Wg of OX 
© return feature transform ®(x) = W(x—x) 





e linear autoencoder: 
maximize )~(maginitude after projection)? 
e principal component analysis (PCA) from statistics: 
maximize (variance after projection) 
e both useful for linear dimension reduction 
though PCA more popular 





linear dimension reduction: 
useful for data processing 
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Deep Learning Principal Component Analysis 


Fun Time 
When solving the optimization problem 
maxy >> _,v7xpx/v subject to vv = 1, 


we know that the optimal v is the ‘topmost’ eigenvector that 
corresponds to the ‘topmost’ eigenvalue \ of XX. Then, what is the 
optimal objective value of the optimization problem? 


0%’ 
@» 
©” 
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Deep Learning Principal Component Analysis 


Fun Time 


When solving the optimization problem 
maxy >> _,v7xpx/v subject to vv = 1, 


we know that the optimal v is the ‘topmost’ eigenvector that 
corresponds to the ‘topmost’ eigenvalue \ of XX. Then, what is the 
optimal objective value of the optimization problem? 


0» 
@» 
©” 
oO 


Reference Answer: (4) 


The objective value of the optimization problem 
is simply v’ XT Xv, which is \v/v and you 
know what v/v must be. 
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Summary 


@ Embedding Numerous Features: Kernel Models 
@ Combining Predictive Features: Aggregation Models 
© Distilling Implicit Features: Extraction Models 


Lecture 13: Deep Learning 
ə Deep Neural Network 
difficult hierarchical feature extraction problem 
ə Autoencoder 
unsupervised NNet learning of representation 
ə Denoising Autoencoder 
using noise as hints for regularization 
ə Principal Component Analysis 
linear autoencoder variant for data processing 





e next: extracting ‘prototype’ instead of pattern 


